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Abstract 

We consider a multi-round auction setting motivated by pay-per-click auctions for Internet advertis- 
ing. In each round the auctioneer selects an advertiser and shows her ad, which is then either clicked 
or not. An advertiser derives value from clicks; the value of a click is her private information. Ini- 
tially, neither the auctioneer nor the advertisers have any information about the likelihood of clicks on 
the advertisements. The auctioneer's goal is to design a (dominant strategies) truthful mechanism that 
(approximately) maximizes the social welfare. 

If the advertisers bid their true private values, our problem is equivalent to the multi-armed bandit 
problem, and thus can be viewed as a strategic version of the latter. In particular, for both problems 
the quality of an algorithm can be characterized by regret, the difference in social welfare between the 
algorithm and the benchmark which always selects the same "best" advertisement. We investigate how 
the design of multi-armed bandit algorithms is affected by the restriction that the resulting mechanism 
must be truthful. We find that truthful mechanisms have certain strong structural properties - essentially, 
they must separate exploration from exploitation - and they incur much higher regret than the optimal 
multi-armed bandit algorithms. Moreover, we provide a truthful mechanism which (essentially) matches 
our lower bound on regret. 

ACM Categories and subject descriptors: F.2.2 [Analysis of Algorithms and Problem Complexity]: 
Nonnumerical Algorithms and Problems; K.4.4 [Computers and Society]: Electronic Commerce; F.1.2 
[Computation by Abstract Devices]: Modes of Computation — Online computation; J.4 [Social and Be- 
havioral Sciences]: Economics 

General Terms: theory, algorithms, economics. 

Keywords: mechanism design, truthful mechanisms, single-parameter auctions, multi-armed bandit 
problem, regret, online learning. 
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1 Introduction 



In recent years there has been much interest in understanding the implication of strategic behavior on the 
performance of algorithms whose input is distributed among selfish agents. This study was mainly moti- 
vated by the Internet, the main arena of large scale interaction of agents with conflicting goals. The field 
of Algorithmic Mechanism Design [35] studies the design of mechanisms in computational settings (for 
background see the recent book [36 ] and survey EH). 

Much attention has been drawn to the market for sponsored search (e.g. l28l[T9l[39l[32l [3ll). a billions 
dollar market with numerous auctions running every second. Research on sponsored search mostly focus on 
equilibria of the Generalized Second Price (GSP) auction |fT9ll39l , the auction that is most commonly used 
in practice (e.g. by Google and Yahoo), or on the design of truthful auctions [2]. All these auctions rely 
on knowing the rates at which users click on the different advertisements (a.k.a. Click-Through-Rates, or 
CTRs), and do not consider the process in which these CTRs are learned or refined over time by observing 
users' behavior. We argue that strategic agents would take this process into account, as it influences their 
utility. Prior work [22] focused on the implication of click fraud on the methods used to learn CTRs. We on 
the other hand are interested in the implications of the strategic bidding by the agents. Thus, we consider 
the problem of designing truthful sponsored search auctions when the process of learning the CTRs is a part 
of the game. 

We are mainly interested in the interplay between the online learning and the strategic aspects of the 
problem. To isolate this issue, we consider the following setting, which is a natural strategic version of 
the multi-aimed bandit (MAB) problem. In this setting, there are k agents. Each agent i has a single 
advertisement, and a private value v j > for every click she gets. The mechanism is an online algorithm 
that first solicits bids from the agents, and then runs for T rounds. In each round the mechanism picks an 
agent (using the bids and the clicks observed in the past rounds), displays her advertisement, and receives a 
feedback - if there was a click or not. Payments are assigned after round T. Each agent tries to maximize 
her own utility: the difference between the value that she derives from clicks and the payment she pays. 
We assume that initially no information is known about the likelihood of each agent to be clicked, and in 
particular there are no Bayesian priors. 

We are interested in designing mechanisms which are truthful (in dominant strategies): every agent 
maximizes her utility by bidding truthfully, for any bids of the others and for any clicks that would have 
been received. The goal is to maximize the social welfareQ Since the payments cancel out, this is equivalent 
to maximizing the total value derived from clicks, where an agent's contribution to that total is her private 
value times the number of clicks she receives. We call this setting the MAB mechanism design problem. 

In the absence of strategic behavior this problem reduces to a standard MAB formulation in which 
an algorithm repeatedly chooses one of the k alternatives ("arms") and observes the associated payoff: 
the value-per-click of the corresponding ad if the ad is clicked, and otherwise. The crucial aspect in 
MAB problems is the tradeoff between acquiring more information (exploration) and using the current 
information to choose a good agent (exploitation). MAB problems have been studied intensively for the 
past three decades (see lfT3l [T4l l20l ). In particular, the above formulation is well-understood (6l |7J [GO in 
terms of regret relative to the benchmark which always chooses the same "best" alternative. This notion of 
regret naturally extends to the strategic setting outlined above, the total payoff being exactly equal to the 
social welfare, and the regret being exactly the loss in social welfare. Thus one can directly compare MAB 
algorithms and MAB mechanisms in terms of welfare loss (regret). 

Broadly, we ask how the design of MAB algorithms is affected by the restriction of truthfulness: what 
is the difference between the best algorithms and the best truthful mechanisms? We are interested both in 

1 Social welfare includes both the auctioneer's revenue and the agents' utility. Since in practice different sponsored search plat- 
forms compete against one another, taking into account the agents' utility increases the platform's attractiveness to the advertisers. 
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terms of the structural properties and the gap in performance (in terms of regret). We are not aware of any 
prior work that characterizes truthful learning algorithms or proves negative results on their performance. 

Our contributions. We present two main contributions. First, we present a characterization of (dominant- 
strategy) truthful mechanisms. Second, we present a lower bound on the regret that such mechanisms must 
suffer. This regret is significantly larger than the regret of the best MAB algorithms. 

Formally, a mechanism for the MAB mechanism design problem is a pair (A,V), where A is the al- 
location rule (essentially, an MAB algorithm), and V is the payment rule. Note that regret is completely 
determined by the allocation rule. As is standard in the literature, we focus on mechanisms in which each 
agent's payment (averaged over clicks) is between and her bid; such mechanisms are called normalized, 
and they satisfy voluntary participation. 

The setting we study is a single-parameter auction, the most studied and well-understood type of auc- 
tions. For such settings truthful mechanisms are fully characterized IT331 : a mechanism is truthful if and 
only if the allocation rule is monotone (by increasing her bid an agent cannot cause a decrease in the number 
of clicks she gets), and the payment rule is defined in a specific and (essentially) unique way. Yet, this char- 
acterization is not the right characterization for the MAB setting! The main problem is that in our setting 
click information for any agent that is not chosen at a given round is not available to the mechanism, and 
thus cannot be used in the computation of payments. Thus, the payment cannot depend on any unobserved 
clicks. We show that this has severe implications on the structure of truthful mechanisms. 

The first notable property of a truthful mechanism is a much stronger version of mono tonicity: 

Definition 1.1. A realization consists of click information for all agents at all rounds (including unobserved 
ones). An allocation rule is pointwise monotone if for each realization, each bid profile and each round, if 
an agent is played at the round, then she is also played after increasing her bid (fixing everything else). 

Let us consider (for the ease of exposition) allocation rules that satisfy the following two natural condi- 
tions. First, an allocation rule is scale-free if it is invariant under multiplying all bids by the same positive 
number (essentially, changing the currency unit). Second, it is Independent of Irrelevant Alternatives (IIA, 
for short) if for any given realization, bid profile and round, a change of bid of agent i cannot transfer the 
allocation in this round from agent j to agent I, where these are three distinct agents. 

We show that any truthful mechanism must have a strict separation between exploration and exploitation. 
A crucial feature of exploration is the ability to influence the allocation in forthcoming rounds. To make this 
point more concrete, we call a round influential for a given realization if for some bid profile changing the 
realization for this round can affect the allocation in some future round. We show that in any such round, 
the allocation can not depend on the bids. Thus, influential rounds are essentially useless for exploitation. 

Definition 1.2. An allocation rule A is called exploration-separated if for any given realization, the alloca- 
tion in any influential round for that realization does not depend on the bids. 

We are now ready to present our main structural result, which is in fact a complete characterization. 

Theorem 1.3. Consider the MAB mechanism design problem. Let A be a non-degeneratJ^ deterministic 
allocation rule which is scale-free and satisfies IIA. Then mechanism (A, V) is normalized and truthful for 
some payment rule V if and only if A is pointwise monotone and exploration-separated. 

2 Non-degeneracy is a mild technical assumption, formally defined in "preliminaries", which ensures that (essentially) if a given 
allocation happens for some bid profile (bi, b-i) then the same allocation happens for all bid profiles (x, b~i), where x ranges over 
some non-degenerate interval. Without this assumption, all structural results hold (essentially) almost surely w.r.t the fc-dimensional 
Lebesgue measure on the bid vectors. Exposition becomes significantly more cumbersome, yet leads to the same lower bounds on 
regret. For clarity, we assume non-degeneracy throughout this version of the paper. 
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We also obtain a similar (but somewhat more complicated) characterization without assuming that allo- 
cations are scale-free and satisfy IIA (Theorem l3.8l ). We use it then to derive Theorem 1 1.31 We emphasize 
that our characterization results hold regardless of whether the auctioneer's goal is to maximize welfare or 
revenue or any other objective. 

In view of Theorem 1 1.31 we present a lower bound on the performance of exploration-separated algo- 
rithms. We consider a setting, termed the stochastic MAB mechanism design problem, in which each click 
on a given advertisement is an independent random event which happens with a fixed probability, a.k.a. the 
CTR. The expected "payoff from choosing a given agent is her private value times her CTR. For the ease 
of exposition, assume that the bids lie in the interval [0,1]. Then the non-strategic version is the stochastic 
MAB problem in which the payoff from choosing a given arm i is an independent sample in [0, 1] with a 
fixed mean //j. In both versions, regret is defined with respect to a hypothetical allocation rule (resp. algo- 
rithm) that always chooses an arm with the maximal expected payoff. Specifically, regret is the expected 
difference between the social welfare (resp. total payoff) of the benchmark and that of the allocation rule 
(resp. algorithm). The goal is to minimize R(T), worst-case regret over all problem instances on T rounds. 

We show that the worst-case regret of any exploration-separated mechanism is larger than that of the 
optimal MAB algorithm [7 ]: fi(T 2 / 3 ) vs O(VT) for a fixed number of agents. We obtain an even more pro- 
nounced difference if we restrict our attention to the 5 -gap problem instances: instances for which the best 
agent is better than the second-best by a (comparatively large) amount 5, that is [i\V\ — 112V2 = 5- (maxj Uj), 
where arms are arranged such that fiiv\ > [12^2 > • • • > fJ-kVk- Such instances are known to be easy for the 
MAB algorithms. Namely, an algorithm can achieve the optimal worst-case regret 0(^kT log T) and re- 
gret 0(j logT) on 5-gap instances |[29l l6l. However, for exploration-separated mechanisms the worst-case 
regret Rs(T) over the 5-gap instances is polynomial in T as long as worst-case regret is even remotely non- 
trivial (i.e., sublinear). Thus, for the <5-gap instances the gap between algorithms and truthful mechanisms 
in the worst-case regret is exponential in T. 

Theorem 1.4. Consider the stochastic MAB mechanism design problem with k agents. Let A be a deter- 
ministic allocation rule that is exploration-separated. Then A has worst-case regret R(T) = $l(k 1 ^ T 2 / 3 ). 
Moreover, if R(T) = 0(T 7 ) for some 7 < 1 then for every fixed 5 < j and A < 2(1 — 7) the worst-case 
regret over the 5-gap instances is R$(T) = 17(5 T A ). 

We note that our lower bounds holds for a more general setting in which the values-per-click can change 
over time, and the advertisers are allowed to change their bids at every time step. 

To complete the picture, we present a very simple (deterministic) mechanism that is truthful and nor- 
malized, and matches the lower bound R(T) = Q.{k 1 ^ T 2 / 3 ) up to logarithmic factors. 

We also provide a number of extensions. First, we prove a similar (but slightly weaker) regret bound 
without the scale-free assumption. Second, we extend some of our results to randomized mechanisms; in this 
setting, (dominant-strategy) truthfulness means "truthfulness for each realization of the private randomness". 
Third, we consider a weaker notion of truthfulness for randomized mechanisms - for each realization of the 
clicks, but in expectation over the random seed, and use this notion to provide algorithmic results for the 
version of the MAB mechanism design problem in which clicks are chosen by an adversary. Fourth, we 
discuss an even more permissive notion of truthfulness - truthfulness in expectation over the clicks (and the 
random seed). 

Other related work and discussion. The question of how the performance of a truthful mechanism com- 
pares to that of the optimal algorithm for the corresponding non-strategic problem has been considered in the 
literature in a number of other auction settings. Performance gaps have been shown for various scheduling 
problems J31 [35l [TH and for online auction for expiring goods [31]. Other papers presented approximation 
gaps due to computational constraints, e.g. for combinatorial auctions lf30l [T8l and combinatorial public 
projects [37], showing a gap via a structural result for truthful mechanisms. 
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The study of MAB mechanisms has been initiated by Gonen and Pavlov fl2T|. The authors present a 
MAB mechanism which is claimed to be truthful in a certain approximate sense. Unfortunately, this mech- 
anism does not satisfy the claimed properties; this was also confirmed with the authors through personal 
communication (see also a similar note in ifrTTl ). 

MAB algorithms were used in the design of Cost-Per- Action sponsored search auctions in Nazerzadeh et 
al. P4l . where the authors construct a mechanism with approximate properties of truthfulness and individual 
rationality. Approximately truthful mechanisms are reasonable assuming the agents would not lie unless it 
leads to significant gains. However, this solution concept is weaker than the exact notion and it may still be 
rational for the agents to deviate (perhaps significantly) from being truthful. Moreover, as truthful bidding is 
not a Nash equilibrium, agents might have an increased incentive to deviate if they speculate that others are 
deviating. All of that may result in unpredictable, and possibly highly suboptimal outcomes. In this paper we 
focus on understanding what can be achieved with the exact truthfulness, mainly proving results of structural 
and lower-bounding nature. We note in passing that providing similar results for the approximately truthful 
setting such as the one in P4l is a worthy and challenging open question. 

Independently and concurrently, Devanur and Kakade [17] have studied truthful MAB mechanisms with 
focus on maximizing the revenue. They present a lower bound of f2(T 2 / 3 ) on the loss in revenue with 
respect to the VCG (Vickrey-Clarke-Groves) payment, as well as a truthful mechanism that matches the 
lower bound. (This mechanism is almost identical to the one that we present in order to match the lower 
bound in Theorem 14. II ) 

Our lower bounds use (a novel application of) the relative entropy technique from 1291 1711, see |[26l for 
an account. For other application of this technique, see e.g. lTT6l l23l l27l ITTTl . 

Our work focuses on regret in a prior-free setting in which the algorithm has no prior on CTRs. This is 
in contrast to the recent line of work on dynamic auctions lPT2l l5l which considers fully Bayesian settings in 
which there is a known prior on CTRs, and VCG-like social welfare-maximizing mechanisms are feasible. 
In our prior-free setting VCG-mechanisms cannot be applied as such mechanisms require the allocation to 
exactly maximize the expected social welfare, which is impossible (and not well-defined) without a prior. 

We require the mechanisms to satisfy a strong notion of truthfulness: bidding truthful is optimal for 
every possible realization (and bids of others). This notion is attractive as it does not require the agents to 
be risk neutral. Moreover, it allows for the CTRs to change over time (and still incentivizes agents to be 
truthful). Finally, an agent never regrets in retrospect that she has been truthful. It is desirable to understand 
this notion before moving to weaker notions. 

Map of the paper. Section |2] is preliminaries. Truthfulness characterization is developed and proved in 
Section [3] The lower bounds on regret and the simple mechanism that matches them are in Section @] 
Extensions and open questions are in Section [5] To improve the flow of the paper, some of the material is 
moved to the appendices. 

2 Definitions and preliminaries 

In the MAB mechanism design problem, there is a set K of k agents numbered from 1 to k. Each agent 
i has a value V{ > for every click she gets; this value is known only to agent i. Initially, each agent i 
submits a bid hi > 0, possibly different from V{. H The "game" lasts for T rounds, where T is the given time 
horizon. A realization represents the click information for all agents and all rounds. Formally, it is a tuple 

3 One can also consider a more realistic and general model in which the value-per-click of an agent changes over time and the 
agents are allowed to change their bid at every round. The case that the value-per-click of each agent does not change over time 
is a special case. In that case truthfulness implies that each agent basically submits one bid as in our model (the same bid at every 
round), thus our main results (necessary conditions for truthfulness and regret lower bounds) also hold for the more general model. 
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p = (pi , ... , pk) such that for every agent i and round t, the bit pi{t) G {0, 1} indicates whether i gets 
a click if played at round t. An instance of the MAB mechanism design problem consists of the number 
of agents k, time horizon T, a vector of private values v = (v i, . . . , Vk), a vector of bids (bid profile) 
b = (&i, . . . , bk), and realization p. 

A mechanism is a pair (^4, "P), where A is allocation rule and V is the payment rule. An allocation 
rule is represented by a function .4 that maps bid profile b, realization p and a round f to the agent i that is 
chosen (receives an impression) in this round: A(b; p;t) = i. We also denote Ai(b;p;t) = l{A(b;p-,t)=i}- 
The allocation is online in the sense that at each round it can only depend on clicks observed prior to that 
round. Moreover, it does not know the realization in advance; in every round it only observes the realization 
for the agent that is shown in that round. A payment rule is a tuple V = (Vi , . . . /Pk), where V%{b\ p) G K 
denotes the payment charged to agent i when the bids are b and the realization is p. □ Again, the payment can 
only depends on observed clicks. A mechanism is called normalized if for any agent i, bids b and realization 
p it holds that Vi(b; p) is non-negative and at most bi times the number of clicks agent i got. 

For given realization p and bid profile b, the number of clicks received by agent i is denoted Ci(b; p). Call 
C = (Ci , ... , Ck) the click-allocation for A. The utility that agent i with value gets from the mechanism 
(A, V) when the bids are b and the realization is p is Ui{vi\ b; p) = V{ ■ Ci(b; p) — Vi(b; p) (quasi-linear 
utility). The mechanism is truthful if for any agent i, value Uj, bid profile b and realization p it is the case 
thatUi(vi;vi,b-i; p) > Ui(vi;bi,b_i; p). 

In the stochastic MAB mechanism design problem, an adversary specifies a vector p = (pi , . . . , pk) 
of CTRs (concealed from A), then for each agent i and round t, realization pi(t) is chosen independently 
with mean pi. Thus, an instance of the problem includes p rather than a fixed realization. For a given 
problem instance X, let i* € argmaxj pi v%, then regret on this instance is defined as 



problem instances X in which all private values are at most v m3iX . Similarly, we define Rg(T;v max _), the 
worst-case S-regret, by taking the supremum only on instances with S-gap. 

Most of our results are stated for non-degenerate allocation rules, defined as follows. An interval is 
called non-degenerate if it has positive length. Fix bid profile b, realization p, and rounds t and t' with 
t < t'. Let i = A(b; p; t) and p' be the allocation obtained from p by flipping the bit pi(t). An allocation 
rule A is non-degenerate w.r.t. (6, p, t, if) if there exists a non-degenerate interval I containing bi such that 



An allocation rule is non-degenerate if it is non-degenerate w.r.t. each tuple (b, p, t, t'). 

3 Truthfulness characterization 

Before presenting our characterization we begin by describing some related background. The click allo- 
cation C is non-decreasing if for each agent i, increasing her bid (and keeping everything else fixed) does 
not decrease Cj. Prior work has established a characterization of truthful mechanisms for single-parameter 
domains (domains in which the private information of each agent is one-dimensional), relating click alloca- 
tion monotonicity and truthfulness (see below). For our problem, this result is a characterization of MAB 

4 We allow the mechanism to determine the payments at the end of the T rounds, and not after every round. This makes that task 
of designing a truthful mechanism easier and thus strengthen our necessary condition for truthfulness (the condition used to derive 
the lower bounds on regret.) 

By abuse of notation, when clear from the context, the "worst-case regret" is sometimes simply called "regret". 



For a given parameter w max , the worst-case 




Ai(x,b-i)ip;s 



) = Ai{b;tp;s 



) for each ip £ {p, p'}, each s € {t, t'}, and all x G I. 
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algorithms that are truthful for a given realization p, assuming that the entire realization p can be used to 
compute payments (when computing payments one can use click information for every round and every 
agent, even if the agent was not shown at that round.) One of our main contributions is a characterization 
of MAB allocation rules that can be truthfully implemented when payment computation is restricted to only 
use clicks information of the actual impressions assigned by the allocation rule. 

An MAB allocation rule A is truthful with unrestricted payment computation if it is truthful with a 
payment rule that can use the entire realization p in it computation. We next present the prior result charac- 
terizing truthful mechanisms with unrestricted payment computation. 

Theorem 3.1 (Myerson ll33l . Archer and Tardos J4)). Let (A, V) be a normalized mechanism for the MAB 
mechanism design problem. It is truthful with unrestricted payment computation if and only if for any given 
realization p the corresponding click-allocation C is non-decreasing and the payment rule is given by 

Vi(bi,b_i;p) = bi ■C i (bi,b-. i ;p) - ^ i C i {x,b- i ]p)dx. (3.1) 

We can now move to characterize truthful MAB mechanisms when the payment computation is re- 
stricted. The following notation will be useful: for a given realization p, let p © t), be the realization 
that coincides with p everywhere, except that the bit pi(t) is flipped. 

The first notable property of truthful mechanisms is a stronger version of monotonicity. Recall (see 
Definition ll.il ) that an allocation rule A is pointwise monotone if for each realization p, bid profile b, round 
t and agent i, if Ai(bi, p; t) = 1 then Ai(bf , b-i] p;t) = 1 for any bf > bi. In words, increasing a bid 
cannot cause a loss of an impression. 

Lemma 3.2. Consider the MAB mechanism design problem. Let (A, V) be a normalized truthful mechanism 
such that A is a non-degenerate deterministic allocation rule. Then A is pointwise-monotone. 

Proof. For a contradiction, assume not. Then there is a realization p, a bid profile b, a round t and agent 
i such that agent i loses an impression in round t by increasing her bid from bi to some larger value bf. 
In other words, we have Ai(bf, b-f, p; t) < Ai(bi, p; t). Without loss of generality, let us assume that 
there are no clicks after round t, that is Pj(t') = for any agent j and any round t' > t (since changes in p 
after round t does not affect anything before round t). 

Let p' = p © l(i, t). The allocation in round t cannot depend on this bit, so it must be the same for 
both realizations. Now, for each realization ip £ {p, p'} the mechanism must be able to compute the price 
for agent i when bids are (bf , b-i). That involves computing the integral Ii(tp) = f <b + Ci(x, b-i] ip) dx 

— i 

from (I3.ll ). We claim that I%(p) ^ h{p')- However, the mechanism cannot distinguish between p and p' 
since they only differ in bit (i, t) and agent i does not get an impression in round t. This is a contradiction. 

It remains to prove the claim. Without loss of generality, assume that pi(t) = (otherwise interchange 
the role of p and p'). We first note that Ci(x, p) < Ci(x, b^i; p') for every x. This is because everything 
is same in p and p' until round t (so the impressions are same too), there are no clicks after round t, and in 
round t the behavior of A on the two realizations can be different only if that agent % gets an impression, in 
which case she is clicked under p' and not clicked under p. 

Since A is non-degenerate, there exists a non-degenerate interval / containing bi such that changing bid 
of agent i to any value in this interval does not change the allocation at round t (both for p and for p'). For 
any x £ I we have Ci(x, b-i] p) < Ci(x, b_i] p'), where the difference is due to the click in round t. It 
follows that Ii(p) < hip 1 )- Claim proved. Hence, the mechanism cannot be implemented truthfully. □ 

Recall (see Definition 11.21 ) that round t is influential for a given realization p if for some bid profile b 
there exists a round t' > t such that A(b; p; t') ^ A(b;p(& l(j,t);t') for j = A(b;p;t). In words: changing 
the relevant part of the realization at round t affects the allocation in some future round t'. An allocation 
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rule A is called exploration-separated if for any given realization p and round t that is influential for p, it 
holds that A(b; p; t) = A{b'\ p\ t) for any two bid vectors 6, b' (allocation at t does not depend on the bids). 

The main structural implication is "truthful implies exploration-separated". To illustrate the ideas behind 
this implication, we first state and prove it for two agents. 

Proposition 3.3. Consider the MAB mechanism design problem with two agents. Let Abe a non-degenerate 
scale-free deterministic allocation rule. If {A, V) is a normalized truthful mechanism for some V, then it is 
exploration separated. 

Proof. Assume A is not exploration-separated. Then there is a counterexample (p, t): a realization p and a 
round t such that round t is influential and allocation in round t depends on bids. We want to prove that this 
leads to a contradiction. 

Let us pick a counterexample (p, t) with some useful properties. Since round t is influential, there exists 
a realization p and bid profile b such that the allocation at some round t' > t (the influenced round) is 
different under realization p and another realization p' = p ® l(j,t), where j = A(b; p;t) is the agent 
chosen at round t under p. Without loss of generality, let us pick a counterexample with minimum value 
of t' over all choices of (b, p, t). For ease of exposition, from this point on let us assume that j = 2. For 
the counterexample we can also assume that pi(i') = 1, and that there are no clicks after round t', that is 
Pl (t") = p'^t") = for all t" > t' and for all I G {1, 2}. 

We know that the allocation in round t depends on bids. This means that agent 1 gets an impression in 
round t for some bid profile b = (pi, b 2 ) under realization p, that is A(b; p; t) = 1. As the mechanism is 
scale-free this means that, denoting bf = b\ b 2 /b 2 we have A(b±, b 2 ; p\ t) = 1. Since A{b%, 62; p; t) = 2 
and A(bf , 62; P) t) = 1, pointwise monotonicity (Lemma [3^21 implies that bf > b\. We conclude that there 
exists a bid bf > b\ for agent 1 such that A(bf , 62; p\ t) = 1. 

Now, the mechanism needs to compute prices for agent 1 for bids (bf, 62) under realizations p and p' , 
that is Vi(bi , 62; p) and V\(bf , b%\ p'). Therefore, the mechanism needs to compute the integral I\{<-p) = 
Ix<b + ^2; <£>) dx for both realizations tp € {p, p'}. 

First of all, for all x < b^ and for all t" < t', A{x, 62; p\ t") = A(x, 62; p'\ t"), since otherwise the 
minimality of t' will be violated. The only difference in the allocation can occur in round t' . 

Let us assume Ai{b\, b%; p; t') < «4i(6i, 62! p', t') (otherwise, we can swap p and p'). We make the 
claim that for all bids x < bf of agent 1, the influence of round t on round t' is in the same "direction": 

A 1 (x,b 2 ;p;t > ) < A 1 (x,b 2 ;p']t > ) for all x<b\. (3.2) 

Suppose (13.21 ) does not hold. Then there is an x < bf such that 1 = A\{x, b 2 ; p) if) > Ai(x, b 2 ; p'\ if) = 0. 
(Note that we have used the fact that the mechanism is deterministic.) If x < bi then pointwise monotonic- 
ity is violated under realization p, since A\(x, b 2 ; p; t') > Ai(b\, b 2 ; p; t')\ otherwise it is violated under 
realization p' , giving a contradiction in both cases. The claim (13.21) follows. 

Since A is non-degenerate, there exists a non-degenerate interval / containing 6j such that if agent 1 
bids any value x € I then A\(x, b 2 ; p; t') < A\(x, b 2 ; p'; t'). Now by (13.21) it follows that h(p) < hip')- 
However, the mechanism cannot distinguish between p and p 1 when the bid of agent 1 is b\, since the 
differing bit p 2 (t) is not observed. Therefore the mechanism cannot compute prices, contradiction. □ 

3.1 General Truthfulness Characterization 

Let us develop the general truthfulness characterization that does not assume that an allocation is scale-free 
and IIA. We will later use it to derive Theorem 1 1.31 

Definition 3.4. Fix realization p and bid vector b. A round t is called (6; p)-secured from agent i if 
A(bf, b-f, p; t) = A(bi,b-i] p;t) for any bf > bi. A round t is called bid-independent w.r.t. p if the 
allocation A(b; p; t) is a constant function of b. 
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The following definitions elaborate on the notion of an influential round. 

Definition 3.5. A round t is called (6; p) -influential, for bid profile b and realization p, if for some round 
tf > tit holds that ^4(6; p;f) ^ A{b; p';t') for realization p' = p®l(j,t) such that j = A(b; p; In this 
case, t' is called the influenced round and j is called the influencing agent of round t. The agent i is called 
an influenced agent of round t if i G {A(b; p; t'), A(b; p'\ t')}. 

Note that a round is influential w.r.t. realization p if and only if it is (6, p) -influential for some b. The 
central property in our characterization is that each (b, /^-influential round is (6, p)-secured. 

Definition 3.6. A deterministic allocation is called weakly separated if for every realization p and each bid 
vector b, it holds that if round t is (b; p) -influential with influenced agent i then it is (6; ^-secured from i. 

We notice that exploration-separated is a stronger notion. 

Observation 3.7. For a deterministic allocation, exploration-separated implies weakly separated^ 

We are now ready to state our general characterization. 

Theorem 3.8. Consider the MAB mechanism design problem. Let Abe a non-degenerate deterministic 
allocation rule. Then mechanism {A, V) is normalized and truthful for some payment rule V if and only if 
A is pointwise monotone and weakly separated. 

Proof. For the "only if direction, A is pointwise-monotone by Lemma [3721 and the fact that A is weakly 
separated is proved similarly to Proposition [33] (albeit with a few extra details). We defer it to Appendix lAl 

We focus on the "if" direction. Let A be a deterministic allocation rule which is pointwise monotone 
and weakly separated. We need to provide a payment rule V such that the resulting mechanism (A, V) is 
truthful and normalized. Since A is pointwise monotone, it immediately follows that it is monotone (i.e., as 
an agent increases her bid, the number of clicks that she gets cannot decrease). Therefore it follows from 
Theorem 13. II that mechanism (A, V) is truthful and normalized if and only if V is given by (13.11) . We need 
to show that V can be computed using only the knowledge of the clicks (bits from the realization) that were 
revealed during the execution of A. 

Assume we want to compute the payment for agent % in bid profile (pi, and realization p. We will 
prove that we can compute Ci(x) := Ci(x, p) for all x < 6j. To compute Ci(x), we show that it is 
possible to simulate the execution of the mechanism with bidj = x. In some rounds, the agent % loses 
an impression, and in others it retains the impression (pointwise monotonicity ensures that agent % cannot 
gain an impression when decreasing her bid). In rounds that it loses an impression, the mechanism does 
not observe the bits of p in those rounds, so we prove that those bits are irrelevant while computing Ci(x). 
In other words, while running with bidj = x, if mechanism needs to observe the bit that was not revealed 
when running with bidj = b\, we arbitrarily put that bit equal to 1 and simulate the execution of A. We 
want to prove that this computes Ci (x) correctly. 

Let t\ < t2 < • • • < t n be the rounds in which agent i did not get an impression while bidding x, but 
did get an impression while bidding 6j. Let p° := p, and let us define realization p l inductively for every 
I G [n] by setting p l := p 1 ^ 1 © U), where ji = A(x, p l ~ l ;ti) is the agent that got the impression 
at round t\ with realization p 1 ^ 1 and bids (x, 

First, we claim that ji ^ i for any I. Indeed, suppose not, and pick the smallest / such that = i. 
Then t\ is a (x, b-f, p l ) -influential round, with influenced agent = i. Thus t\ is (x, 6_j; /^-secured 

6 Note that realizations p and p' are interchangeable. 

7 To see this, simply use the definitions. Fix realization p and bid vector b, let t be a (6; p) -influential round with influenced agent 
i. We need to show that t is (6; p)-secured from i. Round t is (b; p) -influential, thus influential w.r.t. p, thus (since the allocation is 
exploration-separated) it is bid-independent w.r.t. p, thus agent i cannot change allocation in round t by increasing her bid. 
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from i. Since A(x,b-i] p ,ti) = A(x, b-f, p ;ti) = ji i by minimality of I, agent i does not get 
an impression in round ti if she raises her bid to hi. That is, A(b; p ; ti) ^ i. However, the changes in 
realizations p° , ... , p l ~ x only concern the rounds in which agent i is chosen, so they are not seen by the 
allocation if the bid profile is b (to prove this formally, use induction). Thus, A(b; p l ; t{) = A(b; p; ti) = i, 
contradiction. Claim proved. It follows that A(b; p; ti) = i for each I. (This is because by induction, the 
change from p 1 ^ 1 to p l is not seen by the allocation if the bid profile is b.) 

We claim that Ai(x, p; t') = Ai{x, p n ;t') for every round t', which will prove the theorem. If 
not, then there exists I such that Ai(x, p l ; t') ^ Ai(x, b-f, p 1 ^ 1 ; t') for some if (and of course t' > ti). 
Round ti is thus (x, b-f, /^-influential with influenced round t' and influenced agent i. Moreover, the 
influencing agent of that round is ji, and we already proved that ji / i- Since round ti is (x, p l )-secmed 
from agent i due to the "weakly separated" condition, it follows that agent i does not get an impression in 
round t[ if she raises her bid to 6j. That is, A(b; p l \t{) ^ i, contradiction. □ 

Note that we have proven the main characterization (Theorem 11.31 ) for the case of two agents, because 
for two agents IIA trivially holds and in the scale-free case, an allocation is exploration-separated if and only 
if it is weakly separated. 

Let us argue that the non-degeneracy assumption in Theorem 13.81 is indeed necessary. To this end, let 
us present a simple deterministic mechanism (A, V) for two agents that is truthful and normalized, such 
that the allocation rule A is pointwise monotone, scale-free and yet not weakly separated. (The catch is, of 
course, that it is degenerate.) There are only two rounds. Agent 1 allocated at round 1 if and only if b\ > 62- 
Agent 1 allocated at round 2 if b\ > 62 or if b\ = 62 and p\(\) = 1; otherwise agent 2 is shown. This 
completes the description of the allocation rule. To obtain a payment rule V which makes the mechanism 
normalized and truthful, consider an alternate allocation rule A' which in each round selects agent 1 if and 
only if 61 > 62 ■ (Note that A' = A except when 61 = &2-) Use Theorem 13.81 for A' to obtain a normalized 
truthful mechanism (^4', V'), and set V = T" . The payment rule V is well-defined since the observed clicks 
for V and V' coincide unless b\ = 62, in which case both payment rules charge to both players. The 
resulting mechanism (.A, V) is normalized and truthful because the integral in (13.11 ) remains the same even 
if we change the value at a single point. It is easy to see that the allocation rule A has all the claimed 
properties; it fails to be non-degenerate because round t is influential only when b\ = 62- 

3.2 Scalefree and IIA allocation rules 

To complete the proof of Theorem ll.3[ we show that under the right assumptions, an allocation is exploration- 
separated if and only if it is weakly separated. The full proof of this result is in Appendix lAl 

Lemma 3.9. Consider the MAB mechanism design problem. Let A be a non-degenerate deterministic 
allocation rule which is scalefree, pointwise monotone, and satisfies IIA. Then it is exploration-separated if 
and only if it is weakly separated. 

Proof Sketch. We sketch the proof of Lemma [3^91 at a very high level. The "only if" direction was observed 
in Observation 13.71 For the "if" direction, Let A be a weakly-separated mechanism. We prove by a con- 
tradiction that it is exploration-separated. If not, then there is a realization p and a round t such that t is 
influencial w.r.t. p as well as not bid-dependent w.r.t. p. Let round t be influencial with bid vector b, in- 
fluencing agent and influenced agents j and f ^ j in influenced round t' (see Q] in Figured! all boxed 
numbers in this sketch will refer to this figure). 

From the assumption, t is not bid-dependent w.r.t. p, which means that there exists a bid profile b' such 
that i' / / is played in round t with bids b'. Using scalefreeness, IIA, and pointwise-monotonicity, we can 
prove that there exists a sufficiently large bid b^, of agent i' such that she gets an impression in round t with 
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Figure 1 : This figure explains all the steps in the proof of Lemma |3T9l The rows correspond to agents (whose identity 
is shown on the right side), and columns correspond to time rounds. The asterisks show the impressions. The arrows 
show how the impressions get transferred, and labels on the arrows show what causes the transfer. In labels, "in p, 
bi f denotes that a particular transfer of impression is caused in realization p when bid b{ in increased. 

bids (b^, (see [2]). Using the properties of the mechanism, it can further be proved that there is an agent 
i such that she gets the impression in round t when either i increases her bid, or I decreases her bid (see [3]). 
When i increases her bid to bf, she also gets an impression in round t', since impressions cannot differ in 
round t' in the case when I is not played in round t and they must get transferred from j and j' to somebody 
in round t', and IIA implies that this somebody should be i. 

Recall that two different players j and j' get the impression in round t' under p and p' respectively (see 
ED). We prove that either agent j' or agent j must be equal to I (this is done by looking at how the allocation 
in round t' changes when / decreases her bid). Let us break the symmetry and assume j' = I (see box [5]). 
It is also easy to see that when i increases her bid, impression in round if get transferred to her in p (at 
some minimum value b i p , see [6]), and impression in round t' gets transferred to her also in p' (as some 
possibly different minimum value b^ p , see [7])- Using the assumptions of weakly-separatedness, we prove 
that b+ p = b+ p (see [8]). This can be proved by observing that bf > max {6+ p , b+ p }, and then using 
weakly-separatedness of A. Since these two bids were at a "threshold value" (these were the minimum 
values of bids to have transferred the impression in p and p' from j and I respectively), we are able to prove 
that the ratio of bj/bi must be some fixed number dependent on p, p', and t'. In particular, it follows that bi 
belongs to a finite set S(b-i) which depends only on However, by non-degeneracy of A there must be 
infinitely many such bi's, which leads to a contradiction. □ 

4 Lower bounds on regret 

In this section we use structural results from the previous section to derive lower bounds on regret. 

Theorem 4.1. Consider the stochastic MAB mechanism design problem with k agents. Let A be an 
exploration-separated deterministic allocation rule. Then its regret is R(T] v max ) 

Let p,Q = (I , ••• , |) G [0, l] k be the vector of CTRs in which for each agent the CTR is \. For each 
agent i, let jli = (fin, • • • , Pik) & [0, l] k be the vector of CTRs in which agent i has CTR fin = \ + e, 
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e = k 1 / 3 T -1 / 3 , and every other agent j ^ i has CTR /z^- = |. As a notational convention, denote by Pj[-] 
and Ej[-] respectively the probability and expectation induced by the algorithm when clicks are given by /2j. 
Let Ti be the problem instance in which CTRs are given by and all bids are w max - For each agent i, let Ji 
be the problem instance in which CTRs are given by fio, the bid of agent % is v mSLX , and the bids of all other 
agents are v mSLX /2. We will show that for any exploration-separated deterministic allocation rule A, one of 
these 2k instances causes high regret. 

Let N{ be the number of bid-independent rounds in which agent i is played. Note that iVj does not 
depend on the bids. It is a random variable in the probability space induced by the clicks; its distribution 
is completely specified by the CTRs. We show that (in a certain sense) the allocation cannot distinguish 
between /2o and jli if N{ is too small. Specifically, let At be the allocation in round t. Once the bids 
are fixed, this is a random variable in the probability space induced by the clicks. For a given set S of 
agents, we consider the event {At G S} for some fixed round t, and upper-bound the difference between the 
probability of this event under JIq and jli in terms of Ej[iVj], in the following crucial claim, which is proved 
in Appendix |B] via relative entropy techniques. 

Claim 4.2. For any fixed vector of bids, each round t, each agent i and each set of agents S, we have 

| F [At G 5] - Fi[At eS]\< 0(e 2 E {Ni]). (4.1) 

Proof of Theorem l4.lt Fix a positive constant (3 to be specified later. Consider the case k = 2 first. If 
Eo[iVj] > /5T 2 / 3 for some agent i, then on the problem instance Ji, regret is Q(T 2//3 ). So without loss of 
generality let us assume E [A r i] < /?T 2 / 3 for each agent i. Then, plug ging in the values for e and Eo[iVj], 
the right-hand side of (14.11) is at most 0(j3). Take j3 so that the right-hand side of (14.11 ) is at most \. For 
each round t there is an agent i such that Po[-4.t 7^ A > \- Then ¥i[At ^ i] > \ by Oaim l4~2l and therefore 
in this round algorithm A incurs regret 0(e v max ) under problem instance Zj. By Pigeonhole Principle there 
exists an i such that this happens for at least half of the rounds t, which gives the desired lower-bound. 

Case k > 3 requires a different (and somewhat more complicated) argument. Let R = (3 k 1 / 3 T 2 / 3 and 
N be the number of bid-independent rounds. Assume Eo[iV] > R. Then Eo[iVj] < \ EofA^] for some agent 
i. For the problem instance Ji there are, in expectation, E[N — Ni] = Q(R) bid-independent rounds in 
which agent i is not played; each of which contributes f2(v max ) to regret, so the total regret is f2(u max R)- 

From now on assume that Eo [N] < R. Note that by Pigeonhole Principle, there are more than | agents 
i such that Eo[iVj] < 2R/k. Furthermore, let us say that an agent i is good if Po[»^t = i] < | for more than 
T/6 different rounds t. We claim that there are more than | good agents. Suppose not. If agent i is not good 
then Po[«4f = i] > | for at least |T different rounds t, so if there are at least k/2 such agents then 

T = Eli EtiFo[A = i] > | x (§T) x I > kT/3 > T, 

contradiction. Claim proved. It follows that there exists a good agent i such that Eo[iVj] < 2R/k. Therefore 
the right-hand side of (14.11 ) is at most 0{j3). Pick (j so that the right-hand side of (14.11 ) is at most jq. Then 
by Claim FOl for at least T/6 different rounds t we have ¥{[At = i] < ^j- In each such round, if agent i is 
not played then algorithm A incurs regret Q(e v maiX ) on problem instance Xj. Therefore, the (total) regret of 
A on problem instance Zj is Q(e v max T) = ft(v max A; 1 / 3 T 2 / 3 ). □ 

Theorem 4.3. In the setting of Theorem \4.1\ fix k and v max and assume that R(T; f max ) = 0(v mauX T J ) 
for some 7 < 1. Then for every fixed 5 < \ and A < 2(1 — 7) we have Rs(T; v max ) = 0,(5 v ma , x T x ). 

Proof. Fix A G (0, 2(1 — 7)). Redefine //j's with respect to a different e, namely e = T~ A / 2 . Define the 
problem instances Zj in the same way as before: all bids are v mSLX , the CTRs are given by jli. 
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Let us focus on agents 1 and 2. We claim that Ei[iVi] + EafiV^] — flT x , where (3 > is a constant to 
be defined later. Suppose not. Fix all bids to be v m3iX . For each round t, consider event St = {At = 1}- 
Then by Claim l4~2l we have 

|Pi[5«] - F 2 [S t ]\ < |P [S t ] - Pi[5 t ]| + \F [S t ] - ¥ 2 [S t }\ < O (e 2 ) (Ei[JVi] + E 2 [iV 2 ]) < \ 

for a sufficiently small (3. Now, Pi [St] > \ for at least T/2 rounds t. This is because otherwise on problem 
instance Xi regret would be R(T) > Q(eTv max ) = Q>{v m&K T l ~ x / 2 ), which contradicts the assumption 
R(T) = 0(v max T 7 ). Therefore P2[5V] > \ for at least T/2 rounds t, hence on problem instance 1 2 regret 
is at least Q(eTv m3iX ), contradiction. Claim proved. 

Now without loss of generality let us assume that Ei[iVi] > f T A . Consider the problem instance in 
which CTRs given by p,\, bid of agent 2 is v max , and all other bids are v max (l — 25) /(l + 2e). It is easy to 
see that this problem instance has 5-gap. Each time agent 1 is selected, algorithm incurs regret £l(5v max ). 
Thus the total regret is at least Q(5Ni v mSLX ) = Q(8 v mSLX T x ). □ 

Matching upper bound. Let us describe a very simple mechanism, called the naive MAB mechanism, 
which matches the lower bound from Theorem 14.1 [u p to poly logarithmic factors (and also the lower bound 
from Theorem [431 for 7 = A = | and constant 5). U 

Fix the number of agents k, the time horizon T, and the bid vector b. The mechanism has two phases. 
In the exploration phase, each agent is played for To := /c~ 2 / 3 T 2//3 (log T) 1 / 3 rounds, in a round robin 
fashion. Let Cj be the number of clicks on agent i in the exploration phase. In the exploitation phase, an 
agent i* G argmax^ Cj6j is chosen and played in all remaining rounds. Payments are defined as follows: 
agent i* pays max i£ ^]\{i*} Cih/ci* for every click she gets in exploitation phase, and all others pay 0. 
(Exploration rounds are free for every agent.) This completes the description of the mechanism. 

Observation 4.4. Consider the stochastic MAB mechanism design problem with k agents. The naive mech- 
anism is normalized, truthful and has worst-case regret R(T; v max ) = 0(v m & x fc 1 / 3 T 2 / 3 log 2 / 3 T). 

Proof. The mechanism is truthful by a simple second-price argument^] Recall that c, is the number of 
clicks i got in the exploration phase. Let p% = max^j Cjbj/ci be the price paid (per click) by agent i if she 
wins (all) rounds in exploitation phase. If vi > Pi, then by bidding anything greater than pi agent i gains 
V{ — pi utility each click irrespective of her bid, and bidding less than Vi, she gains 0, so bidding V{ is weakly 
dominant. Similarly, if < pi, then by bidding anything less than p, she gains 0, while bidding b{ > pi, 
she loses bi — pi each click. So bidding Vi is weakly dominant in this case too. 

For the regret bound, let (n\ , ... , /Xfe) be the vector of CTRs, and let /Zj = q/To be the sample CTRs. 
By Chernoff bounds, for each agent i we have Pr — > r] < T -4 , for r = y/8 log(T) /Tq. If in 
a given run of the mechanism all estimates p,i lie in the intervals specified above, call the run clean. The 
expected regret from the runs that are not clean is at most 0(v m&x ), and can thus be ignored. From now on 
let us assume that the run is clean. 

The regret in the exploration phase is at most kTov m3iX = 0(v max k 1 ^ T 2 / 3 log 1 / 3 T). For the ex- 
ploitation phase, let j = argmax^ Then (since we assume that the run is clean) we have 

(fii* + r) bi* > pi* bi* > p,j bj > (p,j - r) bj, 

which implies fj,jVj — m*Vi* < r(vj +Vi*) < 2r v mekX . Therefore, the regret in exploitation phase is at most 
2r f max T = 0(v max A; 1 / 3 T 2 / 3 log 2/3 T). Therefore the total regret is as claimed. □ 

8 Independently, Devanur and Kakade 1171 presented a version of the naive MAB mechanism that achieves the same regret 
even in the more general model in which the value-per-click of an agent changes over time and the agents are allowed to submit a 
different bid at every round. Instead of assigning all impressions to the same agent in the exploitation phase, their mechanism runs 
the same allocation and payment procedure for each exploration round separately (see [17] for details). 

'Alternatively, one can use Theorem 13.81 since all exploration rounds are bid-independent, and only exploration rounds are 
influential, and the payments are exactly as defined in Theorem |3.1| 
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5 Extensions and open questions 



We extend our results in several directions. First, we derive a regret lower bound for deterministic truthful 
mechanisms without assuming that the allocations are scale-free. In particular, for two agents there are no 
assumptions. This lower bound holds for any k (the number of agents) assuming IIA, but unlike the one in 
Theorem l4.1l it does not depend on k. See Appendix O for details. 

Second, we extend our results to randomized mechanisms. We consider randomized mechanisms that 
are universally truthful, i.e. truthful for each realization of the internal random seed. For mechanisms that 
randomize over exploration-separated deterministic allocation rules, we obtain the same lower bounds as in 
Theorems 14. II and Theorem l4.3l see Appendix |Pl for the details. 

Third, we consider randomized allocation rules under a weaker version of truthfulness: a mechanism is 
weakly truthful if for each realization, it is truthful in expectation over its random seed. We show that any 
randomized allocation that is "pointwise monotone" and satisfies a certain notion of "separation between 
exploration and exploitation" can be turned into a mechanism that is weakly truthful and normalized. Then 
we apply this result to an algorithm in the literature H]|25l in order to obtain regret guarantees for the version 
of the MAB mechanism design problem in which the clicks are chosen by an oblivious adversary^ (This 
version corresponds to the adversarial MAB problem (7J[l6l|Tl[T0l.) The upper bound matches our lower 
bound for deterministic allocations up to (log fc) 1 / 3 factors. See Appendix |E] for details. 

Fourth, we consider the stochastic MAB mechanism design problem under a more relaxed notion of 
truthfulness: truthfulness in expectation, where for each vector of CTRs the expectation is taken over clicks 
(and the internal randomness in the mechanism, if the latter is not deterministic). Following our line of 
investigation, we ask whether restricting a mechanism to be truthful in expectation has any implications 
on the structure and regret thereof. Given our results on mechanisms that are truthful and normalized, 
it is tempting to seek similar results for mechanisms that are truthful in expectation and normalized in 
expectation^] We rule out this approach: we show that in order to obtain any non-trivial lower bounds on 
regret and (essentially) any non-trivial structural results, one needs to assume that a mechanism is ex-post 
normalized, at least in some approximate sense. The key idea is to view the allocation and the payment as 
multivariate polynomials over the CTRs. See Appendix [0 for the details. 

The two major open questions left open by this work concern structural results and regret lower bounds 
for (i) weakly truthful randomized mechanisms allocations, and (ii) mechanisms that are truthful in ex- 
pectation. The latter question seems to require very different techniques which would further explore the 
connection to polynomials that we used in Appendix [0 
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Appendix A: Truthfulness characterization 

In this section we provide the proofs which did not fit into Section [3] We start with a complete proof of 
the "only if direction of Theorem l3.8l 

Lemma A.l. Consider the MAB mechanism design problem. Let {A, V) be a normalized truthful mecha- 
nism such that A is a non-degenerate deterministic allocation rule. Then A is weakly separated. 

Proof. Assume A is not weakly separated. Then there is a counterexample (p, b, t, t' ', i): a realization p, bid 
vector b, rounds t, t' and agent i such that round t is (6; p) -influential with influenced agent i and influenced 
round t' and it does not holds that round t is (6; p)-secured from i. We prove that this leads to a contradiction.. 

Let us pick a counterexample (p, b, t, t', i) with a minimum value of t' over all choices of (p, b, t, i). 
Without loss of generality, let us assume that pi(t') = 1 and Pj(t") = for all t" > t' and for all agents j. 

Let j = A(b; p; t). As it does not holds that round t is (6; /^-secured from i, this means that j ^ i, and 
there exists a bid bf > bi such that A(bf, b-f, p; t) ^ j. 

Let p' = p © l(j, t). The mechanism needs to compute prices for agent i when her bid is bf under 
realizations p and p' , that is to compute Vi(bf , p) and Vi(bf , b-, L \ p'). Therefore, the mechanism needs 
to compute the integral Ii(<p) = f x<b + Ci(x, <p) dx for both realizations <p G {p, p'}. 
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First of all, for all x < bf and for all t" < t', Ai(x, b-f p; t") = Ai(x, &_»;//; t"). If not,then the 
minimality of t' will be violated. This is because, if there were such an x and t" < t' with Ai(x, b-i; p; t") / 
Ai(x, b_i; p'; t"), then round t will still be (b, p) -influential with influenced agent i, and influenced round 
t" < t', violating the minimality of t". Therefore, when we decrease the bid of agent i, the only difference 
in the allocation can occur at time round t'. 

As i is the influenced agent at round i! it must hold that Ai(bi, b-i; p; t') ^ Ai(bi, b-f, p', t'). Let us 
assume = Ai(bi, b-f p; t') < Ai(bi, b-i\ p', t') = 1 (otherwise, we can swap p and p'). Note that we have 
made use of the fact that the mechanism is deterministic. Let us make the the claim that for all bids x < bf 
the influence of round t on round i! is in the same "direction." 

Ai(x, b-i] p; t') < Ai(x, b-f, p; t') for all x < bf. (A.l) 

Suppose (1A.1I) does not hold. Then there is an x < bf such that 1 = Ai(x, b-f, p; t') > Ai(x, b-i] p'; t') = 
0. (Note that we have used the fact that the mechanism is deterministic.) If x > bi, then pointwise mono- 
tonicity is violated in p' , since = Ai(x, b-i] p'; t') < Ai(bi, p'\ t') = 1. If x < bi on the other hand, 
then the pointwise-monotonicity is violated in p, since 1 = Ai(x, b-f, p; t') > Ai(bi, b-f, p; t') = 0, giving 
a contradiction in both cases. The claim (1A.1I ) follows. 

By the non-degeneracy of A, there exists a non-degenerate interval I containing bi such that 

Ai(x, b-i; p; t') < Ai(x, b-i; p '; t') for all x £ I. (A.l) 

By (IA.1I ) and (IA.2b it follows that Ii(p) < Ii(p'). However, the mechanism cannot distinguish between p 
and p' when agent i's bid is bf, since the differing bit Pj(t) is not seen. Contradiction. □ 

A.l Proof of Lemma 13.91 

For convenience, let us restate the lemma. 

Lemma (Lemma [3]9] restated). Consider the MAB mechanism design problem. Let Abe a non-degenerate 
deterministic allocation rule which is scalefree, pointwise monotone, and satisfies IIA. Then it is exploration- 
separated if and only if it is weakly separated. 

The "only if" direction is a consequence of Observation 13.71 Here we prove the "if" direction. For bid 
profile b, realization p, agent / and round t, the tuple (b; p; I; t) is called an influence-tuple if round t is (b, p)- 
influential with influencing agent /. Suppose allocation A is weakly separated but not exploration-separated. 
Then there is a counterexample: an influence-tuple (b; p; I; t) such that round t is not bid-independent 
w.r.t. realization p. We prove that such counterexample can occur only if b\ € Si(b-i), for some finite 
set Si(b-i) C R that depends only on b-i. 

Proposition A.2. Let A be as in Lemma \3^9\ Assume A is weakly separated. Then for each agent I and each 
bid profile b-i there exists a finite set S[(b-i) C M with the following property: for each counterexample 
(bi, b-i; p; I; t) it is the case that b\ G Si(b-i). 

Once this proposition is proved, we obtain a contradiction with the non-degeneracy of A. Indeed, sup- 
pose (b; p; I; t) is a counterexample. Then (b; p; I; t) is an influence-tuple. Since A is non-degenerate, there 
exists a non-degenerate interval / such that for each x G / it holds that (x, b-f, p; I; t) is an influence-tuple, 
and therefore a counterexample. Thus the set Si(b-i) in Proposition IA.2l cannot be finite, contradiction. 

In the rest of this section we prove Proposition IA.2I Fix a counterexample (b;p;l;t); let t' > t be 
the influenced round. In particular, A(b; p;t) = I (see Q] in Figure Q] on pageQj] all boxed numbers will 
refer to this figure). Then by the assumption there exist bids b' such that A(b'; p; t) = i' ^ We claim 
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that this implies that there exists a bid b~\, > by such that A(bt, p; t) = i' (see \2j). This is proven in 
Lemma IA.4I below, and in order to prove it we first present the following lemma, which essentially states 
that if the mechanism makes a choice between i and j of who to be show, then it can only depend on the 
ratio of their bids bidj/bidj, and not on the bids of other agents. 

Lemma A. 3. Let A be an MAB (deterministic) allocation rule that is pointwise-monotone, scalefree, and 
satisfies IIA. Let there be two bid profiles a and ft such that A(a; p; t) G A((3; p; t) G and 

aijdj = fti/ftj. Then it must be the case that A(a; p; t) = A(ft; p;t). 

Proof. As A is scalefree we assume that = fti and otj = ftj by scaling bids in ft by a factor of / fti (or 
a factor of aj/ ftj), without changing the allocation. 

Assume for the sake of a contradiction that A(ft; p; t) ^ A(a; p; t). Let us number the agents as follows. 
Agents i and j are numbered 1 and 2, respectively. The rest of the agents are arbitrarily numbered 3 to k. 
Consider the following sequence of bid vectors. a(l) = a(2) = a and a(m) = (ft m ,a(m — l)_ m ) for 
m G {3, . . . , k}. As a(l) = a and a(k) = ft, A(a(l); p; t) = A(a; p; t) and A(a(k); p; t) = A(ft; p; t). 
Since A(a(k); p; t) = A(ft;p;t) / A(a;p;t) = A(a(l); p; t) there exists to G {3,..., k} such that 
A(a(m — 1); p;i) = A(a; p; t) G {i, j} while A(a(m); p; t) / A(a(m — 1); p; t). Asm^i and m / j, 
IIA implies that A(a(m); p;t) = m and given that, IIA also implies that A(a(k); p; t) G {to, m + 1, . . . k} 
(note that i, j are not in this set). But as A(a(k); p; t) = A(f3; p; t) G {i, j} this yields a contradiction. □ 

Lemma A. 4. Let A be an MAB (deterministic) allocation rule that is pointwise-monotone, scalefree, and 
satisfies IIA. Let there be two bid profiles a and ft such that A(a; p;t) = i and A(ft; p; t) = j ^ i. Then 
there exists ft^ > fti such that A(ftf , ft-f, p; t) = i. 

In other words, if it is possible for i to get the impression in round t at all, then it is possible for her to 
get the impression starting from any bid profile and raising her bid high enough. 

Proof. We first note that ^ > If not, then ^ < Consider a raised bid of i from ai to af = 
ctj ■ In the bid profile (af , a_j), i must get the impression (by pointwise monotonicity). This gives a 

contradiction to Lemma |A31 since A(af , a-f, p; t) = i G A(ft;p;t) = j G {i,j}, and ^- = 

but A(af, a-i\p\ t) ^ A((3; p; t). 

Now, consider i increasing her bid in profile ft to ftf = ftj ■ Now, A(a;p;t) = i G {i,j}, 

A(ftf ,ft-t; p; t) G {i, j} (from IIA), and ^ = We can apply Lemma IA31 to deduce that A(a; p; t) = 
A(ftf , ft-f, p; t) and both are equal to i since the first allocation is equal to i. □ 

From the lemma above, it follows that agent i' can increase her bid (in bid profile b) and get the impres- 
sion in realization p, round t. To quantify by how much agent i' needs to raise her bid to get the impression, 
we introduce the notion of threshold Ojj (p; t) in the next lemma. 

Lemma A. 5. Let A be an MAB (deterministic) allocation rule that is pointwise monotone, scalefree and 
satisfies IIA. For realization p, round t, two agents i and j ^ i, let bids b^i^j be such that there ex- 
ist Xo and y satisfying A(xQ,y, b^i-j; p; t) = j, and there exists x (possibly dependent on y) satisfying 
A(x, y, b^i^j; p; i) = i. Let us fix such a y and definJt^l 

e i,7 _:7 GM) = |inf {x I A(x,y,b-i)p;t) = i). 

12 Note that if there are no values of bids of i (xo and x) and j (equal to y) such that j can get an impression with small enough 
bid (a;o) of agent i and i can get an impression by raising her bid (to x), then we don't define & b i ~ z ~ J (p; t) at all. We will be careful 
not to use such undefined O's. It is not hard to see that if bids are nonzero, then <dij(p;t) is defined if and only if Oj,i(p',t) is. 
Moreover < @i,j(p;t) < oo, and Qj,i(p;t) = (Oij(p; i)) -1 . 
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Then for any bids b'_i_y ©, j 3 (p, t) is well defined and satisfies Q i •* 3 (p, t) = Q i •* 3 (p, t). We denote it 
by @i t j(p, t), as Q\~j~ 3 (p, t) is independent ofb-i-j. 

Proof. We first prove that if the conditions of the definition of <d\ ~J ~ 3 (p; t) are satisfied for b-i-j, then are 
also satisfied for any other Let us say they are satisfied for b-i-j, that is there exists xq, x and y, 

such that A(xq, y, b-i-j] p; t) = j and A(x, y, b-f, p; t) = i. We want to prove existence of x' and y' for 
b'_i_j. If A(x$, y, b'_ i _j] p; t) = j then existence of y' is proved for b'_^_j too, since y' = y works. If not, 
then A(xq, y, b'-i_j-, p; t) = j 1 ^ j and A(xq, y, b-i-j; p; t) = j, and by Lemma [A4l there exists ay' > y 
such that A(xq, y' , 6^_ 4 - ; p; t) = j. Once the existence of y' is proved, we now prove the existence of x'. 

Let x' = x ■ > x. We have A(x,y , b-i-j ; p;t) = i G {i,j} and A(x' , y', b l _ i _j\ p; t) G {i,j} by IIA 
(i can only transfer impression to her by changing her bid) and x'/y' = x/y. From Lemma |A3l we get 
i = A(x, y, b-i-j; p; t) = A(x', y', b'_i_j; p; t). Hence the existence of x' is proved too. 

b V 

For the sake of contradiction, let us assume that 9 := Q°J- j (p;t) < ®r-~ 3 {p;t) =: 9'. Let us scale 
the bids in (x', y', b'_i_j) by a factor such that the factor times y' is equal to y. We can hence assume that 
y' = y. Let us pick a bid x" G (9y, 9'y). We have A(x", y, b-i-j] p;t) = i (since x" jy is past the threshold 
9), A(x",y' = y, p; t) = j (x"/y' is yet not past the threshold 9'), and x"/y = x" /y'. This is a 

contradiction to the Lemma lA3l Therefore, 9 = 9'. □ 

We conclude that if bj, > bi ■ 0j',z(p, t) then A(b^ , b-i>; p;t) = i' ^ I (see \2\ again). Note that we are 
using @i'j(p; t) since this is well-defined. Define p' = p(B 1(1, t). 

Let us think about decreasing the bid of agent / from bi (it is positive, since all bids are assumed to be 
positive). When the bid of agent / is b\, she gets the impression in round t, but when her bid is small enough 
(in particular as low as b^ /©^ i(p;t)), then she must not get the impression in round t (see Lemma !a3T ). 
When the bid of I decreases, some other agent gets the impression in round t, let us call that agent i (note 
that this agent may not be the same as agent i' above). See 

Now, starting from bid profile b, let us increase the bid of agent i. When the bid of agent i is large 
enough (in particular as large as bi@i' i(p; t)bi/bi>), then I can no longer get the impression in round t (see 
Lemma !a3T ). From IIA, the impression must get transferred to i. Therefore we can define 0j z (/?;£), and 
when bf > biQij(p; t), agent i gets the impression in round t (see [3] again). Note that A(bf, b-i\ p; t) = 
A(bf , b-i\ p'\ t) = i (click information for I at round t cannot influence the impression decision at round t). 

Recall that t' is the influenced round. Let A(b; p; t') = j and let A(b; p'\ t') = j' / j (see [4]). As A is 
pointwise monotone and IIA, A(bf , b-f, p; t') G {i,j} and A(bf , b-f, p'; t') G {i,f}- It must be the case 
that A(bf , b-i; p; t') = A(bf, b-f, p'; t'), as I does not get an impression at round t (and the algorithm does 
not see the difference between p and p'). As j' ^ j we conclude that 

A(bf, b-i-p; t') = A{bf, b- t ; p'; t') = i. 

Next we note that i ^ j and i ^ j'. This is because if i = j (respectively i = j'), then round t 
would be (6; /^-influential (respectively (6; p') -influential) with influenced agent i but it is not (b; /))-secured 
(respectively (b; p')-secured) from i, in contradiction to the assumption. 

We also note that / G {j, j'} (see [5]). Assume for the sake of contradiction that I ^ j and / ^ j'. For 
bj < bi ■ ®i^(p,t) it holds that A(b^ ,b-f, p;t) = A(b^ ,b-i; p';t) = i (since i was defined such that i 
gets the impression in round t when I decreases her bid) thus A(b7 , 6_j; p; t') = A(by , b-i; p'; t') (as click 
information for I at round t is not observed). (Also, as a side note, observe that bT < b\ by pointwise- 
monotonicity since agent I was getting an impression in round t with bid b\ and lost it when her bid is b7.) 
Let A(bT ,b-f, p;t') = A(b7 p'jt') = I'. Note that I' / I, since otherwise, Ai(x,b-f, p;t') is not a 
monotone function of x: it is when x = b\ (since j gets an impression), and 1 when x = bj < bi, a 
contradiction to pointwise-monotonicity. Now, note that the impression in p' at time if transfers from j' to 
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and impression in p at time t' transfers from j to I', none of which ({j,f, I'}) are equal to I and j / j'. 
Let us write this in equations: 

A{b h b„i-p-t') =j A(b-,b^;p-t') = l' 

A(b h b^p';t')=f A(b^b^p';t')=l'. 

It must be the case that either j / I' or j' / /' (since j / j'). If j / I', then in p at time i', reducing the 
bid of / transfers impression from j to I' (both of them are different from thus violating IIA. Similarly, if 
j' ^ then in p' at time t', reducing the bid of I transfers impression from j' to /' (both of them are different 
from I), thus violating IIA. We thus have I € {j, j'}. Let I = j' (since otherwise, we can swap the roles of p 
and p'). 

To summarize what we have proved so far: there are 3 distinct agents i,j,l such that 

A(b; p; t) = A(b; p'\ t) = A(b; p'; t') = I (since A(b; p ; t') = j' = I), 
A(b; p; t') = j and 

A(bf, b-i-p- 1) = A(bf, b-i] P ] t') = A(b+, p'; t) = A(bf, p'; t') = i. 

Observe also that @ij(p, t) = @ij(p', t) as p and p' only differ at a click at round t, and such a click cannot 
determine the allocation decision at round t. Also, max{8jj(p, t') • bj, @i i i(p', t') ■ b\\ < @i,i(p, t) ■ bi as 
the allocation at round t', which is different for p and p' (at b), depends on I getting the impression at round 
till Finally we prove that @i,j(p, t') ■ bj= 9 M (//, t') ■ h (see 

Claim A.6. Qi,j(p,t') • bj = Qu(p',t') ■ h 

Proof. First of all, note that @ij(p; t') and Qij(p', t') are well-defined. Let&j = (@ij(p,t')-bj+@ij(p' ,t')- 
bi)/2. Consider the following two cases. 

If &ij(p,t') ■ bj < Oij(p',t f ) ■ bi then round t is (bi, b-i] p) -influential (as A(pi,b-i; p;t') = i 
and A(b~i,b-i; p';t') = I) with influencing agent I (A(hi,b-i; p;t) = A(hi,b-i; p';t) = I since bi < 
@ij(p, t)-b[) and influenced agent i. Additionally, t it is not (bi, b-f, p)-secured from i (as A(bf , p; t) = 
A(bf , b-f, p'; t) = i). A contradiction to first condition in the theorem. 

Similarly, if @ij(p, t')-bj > @ij(p', t')-bi then round t is (hi, p) -influential (as now A(b~i, b-i; p; t') = 
j and A(b~i,b-i; p';t') = i) with influencing agent / and influenced agent i. Additionally, t it is not 
(hi, b^i] p)-secured from i. Again, a contradiction to the first condition in the theorem. □ 



The lemma implies that 6/ G Si(b_i), where a finite set Si(b^i) is defined by 



Si(b. 



( Qij(p,t') . . 8; ,(/>,£') ] 

< bj - —, r : all agents i, j ^ I, all realizations p, p and all t s.t. ^ . , — - is well-defined > . 

\ 3 e hl ( p i,t>) ^ ®i,i(p',t') J 



This completes the proof of Proposition IA.2I 



Appendix B: Relative entropy technique: proof of Claim 14^2 



We extend the relative entropy technique from [7 ]. All relevant facts about relative entropy are summa- 
rized in the theorem below. We will need the following definition: given a random variable X on a proba- 
bility space T , P), let Fx be the distribution of X, i.e. a measure on K defined by Fx(x) = F[X = x\. 

Theorem B.l. Let p and q be two probability measures on a finite set U, and let Y and Z be functions on 
U. There exists a function F(p; q\Y) : U — > R with the following properties: 



13 In Figure[T]we defined bf p := Qij(p;t')bj and bf p := ®i,i(p \t')bi. These are the bids of agent i at which impression 
transfers to her in round t' in p and p' respectively. See \§\ and \T\ in the figure. 
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(i) E p F(p; q\Y) = E p F(p; q\(Y, Z )) + E p F(p z ; q z \Y) (chain rule), 

(ii) \p(U') - q(U')\ < yJ±V(p\\q)for any event U' C U, where V(p\\q) = E p F(p; q\l) 

(Hi) for each x E U, if conditional on the event {Z = Z(x)} p coincides with q, then F(p; q\Z)(x) = 0. 
(iv) for each x E U, if conditional on the event {Z = Z(x)} p and q are fair and + e)-biased coins, 
respectively, then it is the case that F(p; q\Z)(x) < 4e 2 . 

Remark. This theorem summarizes several well-known facts about relative entropy (albeit in a somewhat 
non-standard notation). For the proofs, see Ifl5l 1241 |26l . In the proofs, one defines F = F(p;q\Y) as a 
function F : U — > K which is specified by F(x) = YIx'gu p{ x 'Wx) lg g^/ j ^j ' where U x is the event 
{Y = Y (x)}0 Note that the quantity E p F(p; q\ 1) is precisely the relative entropy (a.k.a. KL-divergence), 
commonly denoted V(p\\q), and E p F(p; q\Y) is the corresponding conditional relative entropy. 

In what follows we use Theorem IB . 1 I to prove Claim l4~2l For simplicity we will prove (14.11 ) for i = 1. 

The history up to round t is Ht = (hi, h 2 , ... , h t ) where h s E {0, 1} is the click or no click event re- 
ceived by the algorithm at round s. Let Ct be the indicator function of the event "round t is bid-independent". 
Define the bid-independent history as H t = (hi,h 2 , ■■■ ,h t ), where h t = h t C t . For any exploration- 
separated deterministic allocation rule and each round t, the bid-independent history H t -\ and the bids 
completely determine which arm is chosen in this round. Moreover, H t _i alone (without the bids) com- 
pletely determines whether round t is bid-independent, and if so, which arm is chosen in this round. 

Recall the CTR vectors fli as defined in Section [4] Let p and q be the distributions induced on Ht by 
/To and fix, respectively. Let p t and q t be the distributions induced on h t by /2o and fix, respectively. Let 
Ht the support of H t , i.e. the set of all i-bit vectors. In the forthcoming applications of Theorem IB. II the 
universe will be U = Ht- By abuse of notation, we will treat H t as a projection Ht — * Ht, so that it can 
be considered a random variable under p or q. 

ClaimB.2. V(p\\q) = E p F(p;q\ H t ) + £* s =i E p F(p s ; q„\ H s ^x)for any t > 1. 

Proof. Use induction on t > (set Hq = 1). In order to obtain the claim for a given t assuming that it holds 
for t - 1, apply Theorem EH]) with Y = H t -x and Z = h t . □ 

Claim B.3. F(p t ;qt\ H t -x) < 4e 2 Ct l{^ t=1 } for each round t. 

Proof. We are interested in the function F = F(p t ; qt\ H t ~x) '■ T~(-t — ► ^- Given H t -x, one of the following 
three cases occurs: 

• round t is not bid-independent. Then h t = 0, hence F(-) = by Theorem IB - ltHTTT ). 

• round t is bid-independent and arm 1 is not played. Then ht is distributed as a fair coin under both p 
and q, so again F(-) = 0. 

• round t is bid-independent and arm 1 is played. Then F(-) < 4e 2 by Theorem IB . Utrvb . □ 

Given the full bid-independent history Ht, p and q become (the same) point measure, so by Theo- 
rem ETKI]) E p F(p; q\ H T ) = 0. Therefore taking ClaimEHwith t = T we obtain 

T T 

V(p\\q) = Y,E p F(pf,qt\H t - 1 ) = 4e 2 Y, E p[ C t l {A t =i}] = 4e 2 E p [Nx]. (B.l) 
t=i t=i 

For a given round t and fixed bids, the allocation at round t is completely determined by the bid-independent 
history Ht-x- Thus, we can treat {A t E S} as an event in Ht- Now (14.11 ) follows from (IB.ll) via an 
application of Theorem IBjjn]) with V = {A t E S}. 

14 We use the convention thatp(a;) log(p(x)/q(x)) is whenp(:r) = 0, and +oo whenp(a;) > and q(x) = 0. 
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Appendix C: Lower bound for non-scalefree allocations 



In this section we derive a regret lower bound for deterministic truthful mechanisms without assuming 
that the allocations are scale-free. In particular, for two agents there are no assumptions. This lower bound 
holds for any k (the number of agents) assuming that the allocation satisfies IIA, but unlike the one in 
Theorem 14. H it does not depend on k. 

Theorem C.l. Consider the stochastic MAB mechanism design problem with k agents. Let (A,V) be a 
normalized truthful mechanism such that A is a non-degenerate deterministic allocation rule. Suppose A 
satisfies IIA. Then its regret is R(T; u max ) = ^(«max J 1 ^ 3 ) far any sufficiently large Umax- 
Let us sketch the proof. Fix an allocation A. In Definition 13.41 if round t is (b, p) influential, for 
some realization p and bid vector b, an agent i is called strongly influenced by round t if it is one of the 
two agents that are "influenced" by round t but is not the "influencing agent" of round t. In particular, 
it holds that A(b, p, t) ^ i. For each realization p, round t and agent i, if there exists a bid vector b 
such that round t is (6, p) -influential with strongly influenced agent i, then fix any one such b, and define 
b* = b*(p,t) := maxj^j&j. Let us define B*^ = maxp^j b*(p,t), where the maximum is taken over all 
realizations p, all rounds t, and all agents i. Let us say that round t is B*-free from agent i w.r.t realization 
p, if for this realization the following property holds: agent i is not selected in round t as long as each bid is 
at least B*. 

Lemma C.2. In the setting of Theorem IC.il for any realization p, any influential round t is B*^-free from 
some agent w.r.t. p. 

Proof. Fix realization p. Since round t is influential, for some bid profile b and agent i it is (b, p) -influential 
with a strongly influenced agent i. By definition of b*(p, t), without loss of generality each bid in b (other 
than i's bid) is at most b*(p, t) < B^. Then A(b, p, t) ^ i, and round t is (b, p)-secured from agent i. 

Suppose round t is not S^-free from agent i w.r.t p. Then there exists a bid profile b' in which each bid 
(other than i's bid) is at least B*^ such that A{b' , p, t) = i. To derive a contradiction, let us transform b to 
b' by adjusting first the bid of agent i and then bids of agents j / i one agent at a time. Initially agent i is 
not chosen in round t, and after the last step of this transformation agent i is chosen. Thus it is chosen at 
some step, say when we adjust the bid of agent i or some agent j ^ i. This transfer of impression to agent 
i cannot happen when bid of agent i is adjusted from bi to b\ (since round t is (b; /^-secured from i), and 
it cannot happen when bid of player j ^ £ is adjusted from bj to £/• > bj (this is because, the transfer to 
i cannot happen from j because of pointwise-monotonicity and the transfer to i cannot happen from I / j 
because of IIA). This is a contradiction. □ 

Let T be the time horizon. Assume u max > 25^. Let N(p) be the number of influential rounds w.r.t 
realization p. Let Ni (p) be the number of influential rounds w.r.t. realization p that are l?^-free from agent 
i w.r.t. p. Then N and the iVj's are random variables in the probability space induced by the clicks. By 
Lemma IC2l we have that ]T\ Ni(p) is at least the number of influential rounds. As in Section 01 let /7q be 
the vector of CTRs in which all CTRs are i, and let Eo[-] denote expectation w.r.t. fa. 

Fix a constant (3 > to be specified later. If E Q [N] > /3kT 2 / 3 then E [iVj] > /3T 2 / 3 for some agent 
i, so the allocation incurs expected regret R(T; f max ) > ^(%m T 2 / 3 ) on any problem instance Jj, j ^ i. 
(In this problem instance, CTRs given by /2 , the bid of agent j is t> max , and all other bids are v max /2.) Now 
suppose EofA^] < j3k T 2//3 . Then the desired regret bound follows by an argument very similar to the one in 
the last paragraph of the proof of Theorem 14. II 
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Appendix D: Universally truthful randomized mechanisms 

Consider randomized mechanisms that are universally truthful, i.e. truthful for each realization of the in- 
ternal random seed. For mechanisms that randomize over exploration-separated deterministic mechanisms, 
we obtain the same lower bounds as in Theorems 14. 1 1 and Theorem [ 



Theorem D.l. Consider the MAB mechanism design problem. Let T> distribution over exploration-separated 
deterministic allocation rules. Then 

^AeV [RA(T;v max )} = Qiv^k^T 2 / 3 ). 

Proof Sketch. Recall that in the proof of Theorem 14.11 we define a family T of 2k problem instances, and 
show that if A is an exploration-separated deterministic allocation rule, then on one of these instances its 
regret is "high". In fact, we can extend this analysis to show that the regret is "high", that is at least 
R* = 0(ti max A; 1 / 3 T 2 / 3 ), on an instance 2 £ J 7 chosen uniformly at random from T\ here regret is in 
expectation over the choice of X.Pl Once this is proved, it follows that regret is R*/2 for any distribution 
over such A, in expectation over both the choice of A and the choice of 1. Thus there exists a single 
(deterministic) instance 1 such that E AeT > [Ra,i(T)\ > R*/2. □ 

Theorem 14 . 3 1 extends similarly. 

Appendix E: Randomized allocations and adversarial clicks 

In this section we discuss randomized allocations and the version of the MAB mechanism design prob- 
lem when clicks are generated adversarially, termed the adversarial MAB problem. In this version, the 
objective is to optimize the worst-case regret over all values v = (yx , ... , u&) such that Vi £ [0, u max ] for 
each i, and all realizations p: 

R(T;v;p) = max* v t Yh=iP^) ~ Th=i Ya=i v i ft CO E [A (v;p; t)\ (E.l) 
R(T; w max ) = max{i?(T; v; p) : all realizations p, all v such that v- L G [0, u max ] for each i}. 



The first term in (IE. II ) is the social welfare from the best time-invariant allocation, the second term is the 
social welfare generated by A. 

Let us make a few definitions related to truthfulness. Recall that a mechanism is called weakly truthful if 
for each realization, it is truthful in expectation over its random seed. A randomized allocation is pointwise 
monotone if for each realization and each bid profile, increasing the bid of any one agent does not decrease 
the probability of this agent being allocated in any given round. For a set S of rounds and a function a : S — > 
{agents}, an allocation is (S, a)-separated if (i) it coincides with a on S, (ii) the clicks from the rounds not 
in S are discarded (not reported to the algorithm). An allocation is strongly separated if before round 1, 
without looking at the bids, it randomly chooses a set S of rounds and a function a : S — > {agents}, and 
then runs a pointwise monotone (S, a) -separated allocation. Note that the choice of S and a is independent 
of the clicks, by definition. 

We show that for any (randomized) strongly separated allocation rule A there exists a payment rule 
which results in a mechanism that is weakly truthful and normalized. Then we consider PSlM ||ll|25l, a 
randomized MAB algorithm from the literature, and show that it is pointwise monotone and strongly sepa- 
rated. When interpreted as an allocation rule, there algorithm has strong regret guarantees for the adversarial 
MAB mechanism design problem, where the clicks are chosen by an oblivious adversary. Specifically, PSlM 
obtains regret R(T, w max ) = 0(v max k 1 ^ 3 (log k) 1 ' 3 T 2 ' 3 ). 

We start with the structural result. 



15 This extension requires but minor modifications to the proof of Theorem l4.1l For instance, for the case k > 3 we argue that 
first, if E [N] > R then Eo[JVj] < jE [N] for at least | agents i (and so on), and if E [N] < R then (omitting some details) 
there are Q(k) good agents i such that Eo [iVi] < 2R/k (and so on). 
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Lemma E.l. Consider the MAB mechanism design problem. Let Abe a (randomized) strongly separated 
allocation rule. Then there exists a payment rule V such that the resulting mechanism (A, V) is normalized 
and weakly truthful. 

Proof. Throughout the proof, let us fix a realization p, time horizon T, bid vector b, and agent i. We will 
consider the payment of agent i. We will vary the bid of agent i on the interval [0, bi]; the bids 6_j of all 
other agents always stay the same. 

Let Ci(x) be the number of clicks received by agent % given that her bid is x. Then by (the appropriate 
version of) Theorem l3. li the payment of agent i must be Vi(b) such that 



®A[Pi(b)] = IE4 h aibi) - f*i a(x) dx 



(E.2) 



where the expectation is taken over the internal randomness in the algorithm. 

Recall that initially A randomly selects, without looking at the bids, a set S of rounds and a function 
a : S — » {agents}, and then runs some pointwise monotone (5, er)-separated allocation A^ S ' a \ In what 
follows, let us fix S and a, and denote .A* = A^ s,a \ We will refer to the rounds in S as exploration 
rounds, and to the rounds not in S as exploitation rounds. Let 7*(x, t) be the probability that algorithm A* 
allocates agent i in round t given that agent i bids x. Note that for fixed value of internal random seed of 
A* this probability can only depend on the clicks observed in exploration rounds, which are known to the 
mechanism. Therefore, abstracting away the computational issues, we can assume that it is known to the 
mechanism. Define the payment rule as follows: in each exploitation round t in which agent i is chosen and 
clicked, charge 



V*(b,t) 



j*(x,t)dx. 



7*(M) 

Then the total payment assigned to agent i is 

-Pi(,b)=Et?sPi(t) A*(b;p;t)V*(b,t). 



(E.3) 



(E.4) 



Since allocation A* is pointwise monotone, the probability -y*(x,t) is non-decreasing in x. Therefore 
V*(b, t) G [0, bi] for each round t. It follows that the mechanism is normalized (for any realization of the 
random seed of allocation A). 

It remains to check that the payment rule (1E.3I) results in (1E.2I ). Let c*(x) be the number of clicks 
allocated to agent i by allocation A* given that her bid is x. Let c* pl (x) be the corresponding number of 
clicks in exploitation rounds only. Since A* is (S, <r)-separated, we have 



Hc*(x) - c i (x)] = ^2 t€S p a ( t ){t) = const(x). 
Taking expectations in (1E.4I) over the random seed of As and using (1E.5I) . we obtain 

nVi(b)] = E#s Pi(t)li(bi,t)V?(b,t) 

= Et^s Pi (*) b i 1* ( b i > *) - Jo 7* (x, t) dx 

= bi [E t ?s Pi(t) t?(M)] - Jo* [E^s fH(t)it(x,t) 

= b l E[cT P \b l )]-^E[cT P \x)]dx 



(E.5) 



dx 



E 



hc*(h) - Jq' c*{x)dx 



Finally, taking expectations over the choice of S and a, we obtain (1E.2I) . 



□ 



24 



E.l Algorithm PSim is strongly separated 

In this subsection, we consider PSlM [8] [251, an algorithm for the adversarial MAB problem. We interpret 
this algorithm as an allocation rule, and observe that it is strongly separated. 
As usual, k denotes the number of agents; let [k] denote the set of agents. 

Input: Time horizon T, bid vector b. Let t> max = maxj 6j. 
Output: For each round t < T, a distribution on [k] . 

1. Divide the time horizon into P phases of T/P consecutive rounds each. 

2. From rounds of each phase p, pick without replacement k rounds at random (called the exploration 
rounds) and assign them randomly to k arms. Let S denote the set of all exploration rounds (of all 
phases). Let / : S — > [k] be the function which tells which arm is assigned to an exploration round in 
S. The rounds in [T] \ S are called the exploitation rounds. 

3. LetWi(O) = lforalH G [k]. 

4. For each phase p = 1, 2, . . . , P 

(a) For each round t in phase p 

i. If t G S and f(t) = i, then define the distribution 7(6; t; S, /) such that 7j(6; t; S, f) = 1. 
Pick an agent according to this distribution (equivalently, pick agent i), observe the click 
Pi(t), and update Wi{p) multiplicatively, 

Wi (p) = Wl ( p - 1) • (1 + ejwWfc/w. 

ii. If t £ S, then define the distribution 7(6; t; S, /) such that 7^(6; t; S, f) = ^"ffi^-i) • Pick 
an agent according to 7(6; i; S, f), observe the feedback, and discard the feedback. 

If we pick the values e = (fclog k/T) 1 ^ 3 and P = (log k) l / 3 (T /k) 2 / 3 , then the regret of PSlM is bounded 
by C?((fclogA;) 1 / 3 T 2 / 3 w max ) against any oblivious adversary (see (8j|25l). 
We next prove that PSlM is strongly-separated. 

It is clear from the structure of PSlM above that it chooses a set S of exploration rounds and a function 
/ : S — * [k] in the beginning without looking at the bids and then runs an (S, /)-separated allocation. 
We need to prove that the (5, /)-separated allocation is pointwise monotone. For this we need prove that 
the probability 7^(6; t; S, /) is monotone in the bid of agent i, where 7^(6; t; S, f) denotes the probability 
of picking agent i in round t when bids are b given the choice of S and /. If t G S, the 7^(6; t; S, f) is 
independent of bids, and hence is monotone in b{. Let t S and t is a round in phase p. Let us denote by 
the (unique) exploration round in phase p assigned to agent i. We then have 

n(b;t;S,f) = (l + e^E^iftCT 1 ^)) / ^(^^E^wfr^',?)). 

j 

We split the denominator into the term for agent i and all other terms. It is then not hard to see that this is a 
non-decreasing function of bi. 

We state the above results in the form of the following corollary. 

Corollary E.2. There exists a weakly truthful normalized mechanism for the adversarial MAB problem 
(against oblivious adversary) whose regret grows as 0((k log A;) 1 / 3 • T 2 / 3 • v max ). 
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Appendix F: Truthfulness in expectation over CTRs 



We consider the stochastic MAB mechanism design problem under a more relaxed notion of truthful- 
ness: truthfulness in expectation, where for each vector of CTRs the expectation is taken over clicks (and 
the internal randomness in the mechanism, if the latter is not deterministic). We show that any allocation 
A* that is monotone in expectation^ can be converted to a mechanism that is truthful in expectation and 
monotone in expectation, with minor changes and a very minor increase in regret. Furthermore, we show 
that there exist MAB allocations that are monotone in expectation whose regret matches the optimal upper 
bounds for MAB algorithms. The conclusion is that in order to obtain any non-trivial lower bounds on 
regret and (essentially) any non-trivial structural results, one needs to assume that a mechanism is ex-post 
normalized, at least in some approximate sense. 

The main result of this section is that for any allocation A* that is monotone in expectation, any time 
horizon T, and any parameter 7 € (0, 1) there exists a mechanism (A, V) such that the mechanism is 
truthful in expectation and normalized in expectation, and allocation A initially makes a random choice 
between A* and some other allocation, choosing A* with probability at least 7. We call such allocation A 
a ^-approximation of A*. Clearly, on any problem instance we have Rji&T) < 7 Rj\* (T) + (1 — j)T. The 
extra additive factor of (1 — j)T is not significant if e.g. 7 = 1 — i. The problem with this mechanism 
is that it is not ex-post normalized; moreover, in some realizations payments may be very large in absolute 
value. 

Theorem F.l. Consider the stochastic MAB mechanism design problem with k agents and a fixed time 
horizon T. For each 7 £ (0, 1) and each allocation rule A* that is monotone in expectation, there exists 
a mechanism (A, V) such that A is a ^-approximation of A*, and the mechanism is truthful in expectation 
and normalized in expectation. 

Remark. Payment rule V is well-defined as a mapping from histories to numbers. We do not make any 
claims on the efficient computability thereof. 

For the sake of completeness, we provide a concrete algorithm which one could plug into Theorem IF 11 
and obtain improved (and in fact, best possible) regret guarantees. 

Proposition F.2. Consider the stochastic MAB mechanism design problem with k agents and a fixed time 
horizon T. There exists an allocation rule A that is monotone in expectation, whose regret is R(T; v max ) = 
0(v max \JkT\ogT) in the worst case, and Rs{T; t> max ) = 0(v max | logT) on the 5-gap instances. 

Proof Sketch. For simplicity, assume t> max = 1. Let r$ = y / 81og(T)/T. Consider the following simple 
allocation. Initially, each agent is active. In each phase, play each active agent once, in a round-robin fashion. 
After the phase, (permanently) de-activate each agent whose sample product (sample average times the bid) 
is more than tq below that of some other active agent. This completes the description of the allocation. 

This allocation is based on a well-known (perhaps folklore) MAB algorithm. The regret bounds are 
proved along the lines of those in (6). The crucial observations are that with a very high probability the 
optimal agent is never de-activated, and that that each sub-optimal agent i is played at most 0(A 4 ~ 2 log T) 
times, where Aj is the difference between her product (CTR times the bid) and the maximal one. 

The allocation is monotone in expectation because increasing the bid of a given agent cannot cause this 
agent to be de-activated later. □ 

16 Monotonicity in expectation is defined in an obvious way: an allocation is monotone in expectation if for each agent i and fixed 
bid profile b-i, the corresponding expected click-allocation is a non-decreasing function of bi\ here the expectation is taken over 
the clicks and possibly the allocation's random seed. 
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El Proof of Theorem EI 



Let -4 e xpi be the allocation rule where in each round an agent is chosen independently and uniformly at 
random. Allocation A is defined as follows: use A* with probability 7; otherwise use -4 e xpi- Fix an instance 
(b, p) of the stochastic MAB mechanism design problem, where b = (b\ , ... ,bk) and fj, = (pi , ... , pk) 
are vectors of bids and CTRs, respectively. Let d = Ci(bf, b-i) be the expected number of clicks for agent 
i under the original allocation A*. Then by Myerson [33] the expected payment of agent i must be 



bi Ci(bi) b-i) - J bl Cj(x; b-i) dx 



(F.l) 



The key idea is to treat the expected payment as a multivariate polynomial over pi , ... , p^. It is essential 
(given the way we define V) to show that this polynomial has degree < T. 



Claim F.3. Vf is a polynomial of degree < T in variables pi , ... , pu- 

Proof. Fix the bid profile. Let X t be allocation of algorithm A*. Let poly(T) be the set of all polynomials 
over n\, ... , /ifc of degree at most T. Consider a fixed history h = (xi ,yi; ... ; xt, yr), and let h l be the 
corresponding history up to (and including) round t. Then 

m = nf=i Pr[X t = x t I h 1 - 1 } pl\ (1 - aO 1-1 " e poiy(T) (F.2) 
Ci(6 i; = E fte H p W #cUck Si (/t) G poly(T). (F.3) 

Therefore Pf 1 G poly(T), since one can take an integral in (IF II) separately over the coefficient of each 
monomial of Ci(x; b-i). □ 

Fix time horizon T. For a given run of an allocation rule, the history is defined as h = {x\,y\\ ... ;xt,Ut) 
where xt is the allocation in round t, and yt 6 {0, 1} is the corresponding click. Let Ti be the set of all 
possible histories. 

Our payment rule V is a deterministic function of history. For each agent i, we define the payment V% = 
Vi{h) for each history h such that E h [Vi(h)] = Vf for any choice of CTRs, and hence E h [Vi(h)] = Vf, 
where = denotes an equality between polynomials over pi , ... , /i^. 

Fix the bid vector and fix agent i. We define the payment Vi as follows. Charge nothing if allocation 
A* is used. If allocation -4 e xpi is used, charge per monomial. Specifically, let mono(T) be the set of all 
monomials over \ii , ... , /ife of degree at most T. For each monomial Q G mono(T) we define a subset of 
relevant histories TLi{Q) C Ti. (We defer the definition till later in the proof.) For a given history h G H 
we charge a (possibly negative) amount 

Vi{h) = £ Qem ono(T): he H i{ Q) (F.4) 

where deg(Q) is the degree of Q, and Vf l (Q) is the coefficient of Q in "Pf 1 . Let P exp i be the distribution on 
histories induced by «4 exp i- Then the expected payment is 

E h [Vi{h)] = E Q e*oMT) kdBgiQ) ^mi(Q)) Vf{Q). 

Therefore in order to guarantee that Eh[Vi(h)] = Vf 1 it suffices to choose TCi(Q) for each Q so that 

fc dcg{Q) Pex P i[^(Q)]^Q. (F.5) 

Consider a monomial Q = p,^ 1 . . . p^ k . Let TLi(Q) consist of all histories such that first agent 1 is played 
«i times in a row, and clicked every time, then agent 2 is played 012 times in a row, and clicked every time, 
and so on till agent k. In the remaining T — deg(Q) rounds, any agent can be chosen, and any outcome 
(click or no click) can be received. It is clear that (IF 5 1 ) holds. 
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